Goto

Collaborating Authors

 rack type


Rack Position Optimization in Large-Scale Heterogeneous Data Centers

arXiv.org Artificial Intelligence

As rapidly growing AI computational demands accelerate the need for new hardware installation and maintenance, this work explores optimal data center resource management by balancing operational efficiency with fault tolerance through strategic rack positioning considering diverse resources and locations. Traditional mixed-integer programming (MIP) approaches often struggle with scalability, while heuristic methods may result in significant sub-optimality. To address these issues, this paper presents a novel two-tier optimization framework using a high-level deep reinforcement learning (DRL) model to guide a low-level gradient-based heuristic for local search. The high-level DRL agent employs Leader Reward for optimal rack type ordering, and the low-level heuristic efficiently maps racks to positions, minimizing movement counts and ensuring fault-tolerant resource distribution. This approach allows scalability to over 100,000 positions and 100 rack types. Our method outperformed the gradient-based heuristic by 7\% on average and the MIP solver by over 30\% in objective value. It achieved a 100\% success rate versus MIP's 97.5\% (within a 20-minute limit), completing in just 2 minutes compared to MIP's 1630 minutes (i.e., almost 4 orders of magnitude improvement). Unlike the MIP solver, which showed performance variability under time constraints and high penalties, our algorithm consistently delivered stable, efficient results - an essential feature for large-scale data center management.


Facebook's Expanding Machine Learning Infrastructure

#artificialintelligence

Here at The Next Platform, we tend to keep a close eye on how the major hyperscalers evolve their infrastructure to support massive scale and evermore complex workloads. Not so long ago the core services were relatively standard transactions and operations, but with the addition of training and inferencing against complex deep learning models--something that requires a two-handed approach to hardware--the hyperscale hardware stack has had to quicken its step to keep pace with the new performance and efficiency demands of machine learning at scale. While not innovating on the custom hardware side quite the same way as Google, Facebook has shared some notable progress in fine-tuning its own datacenters. From its unique split network backbone, neural network-based viz system, to large-scale upgrades to its server farms and its work honing GPU use, there is plenty to focus on infrastructure-wise. For us, one of the more prescient developments from Facebook is its own server designs which now serve over 2 billion accounts as of the end of 2017, specifically its latest GPU-packed Open Compute based approach.